City College of San Francisco


MATH 108 - Foundations of Data Science

Lecture 36: Updating Predictions¶

Associated Textbook Sections: 18.0 - 18.2

Outline¶

  • Decisions
  • Conditional Probability
  • Tree Diagrams
  • Bayes' Rule
  • Subjective Probabilities

Set Up the Notebook¶

In [1]:
from datascience import *
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Decisions¶

Decisions Under Uncertainty¶

Interpretation by Physicians of Clinical Laboratory Results (1978)

We asked 20 house officers, 20 fourth-year medical students and 20 attending physicians, selected in 67 consecutive hallway encounters at four Harvard Medical School teaching hospitals, the following question:

If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming that you know nothing about the person's symptoms or signs?

Eleven of 60 participants, or 18%, gave the correct answer. These participants included four of 20 fourth-year students, three of 20 residents in internal medicine and four of 20 attending physicians. The most common answer, given by 27, was that [the chance that a person found to have a positive result actually has the disease] was 95%.

Medical Testing Scenario¶

  • Rare disease with prevalence of 1/1000 in population
  • There is a test (e.g., antigen test) with the following properties
    • False Positive Rate of 5%: If you do NOT have the disease then 5% of the time, the test says you do.
    • False Negative Rate of 1%: If you DO have the disease then 1% of the time, the test says you do not have the disease.
  • If you sample a person at random and they test positive, what is the chance they have the rare disease?

Truth and Test Results¶

All patients fall into one of 4 categories:

Table showing the 4 possible outcomes for the patient.

False Positive Rate¶

Same table focusing on the first row as if they test indicated positive.

False Negative Rate¶

Same table focusing on the first row as if they test indicated negative.

Another Scenario¶

  • Class consists of Freshmen (60%) and Sophomores (40%)
  • Some of the students have declared their major
    • 50% of the Freshmen years have declared their major
    • 80% of the Sophomores years have declared their major
  • I pick one student at random ...
  • That student has declared a major!
  • Which is more likely: Freshman or Sophomore?

What do these scenarios have in common?¶

  • There is some chance event that I am interested in
    • person has a disease
    • the student's year
  • I start with some prior (before observing anything) information about that quantity P(Disease) or P(Year)
  • I then observe something whose value depends probabilistically on the original chance event Test is Positive, student has declared Neither exactly determines the original event
  • How do I update the probability of the original event given the additional information?

Conditional Probability¶

Conditional Probability¶

Probability of an event given some information (it is conditioned on the information) Example:

  • “80% of sophomores are Declared”
  • P(Declared | Sophomore) = 0.8 <--- Notation

Conditional vs Joint Probabilities¶

  • Recall the joint probability of two events:
    • P(Declared, Sophomore) = chance of a random student being a declared and a Sophomore
  • Conditional probability (the stuff after | is given):
    • P(Declared | Sophomore) = chance of a random Sophomore student being declared
  • Which one is bigger?

Answer: the conditional, will see why in a moment.

An Example¶

In [2]:
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vRiLsFDsuuT\
_fGEkjNJJ5Yv6MdEkWshYniIDyrzR4F4vN7UkAUgwT-MrhUTy8_gxwyhLv3rTleNScXw\
/embed?start=false&loop=false&delayms=3000', 960, 569)
Out[2]:

Tree Diagrams¶

Tree Diagrams¶

In [3]:
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vTYqt2\
-0qckaBNAHfug29S4o0IV-tCrPkOp3a01wWsx65iyAmpFX3gI9ROkaZ21Syf77\
xyiIIDrGAgS/embed?start=false&loop=false&delayms=3000', 960, 569)
Out[3]:

Bayes' Rule¶

Bayes' Rule¶

In [4]:
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vSTI_AHfonqA-\
ww_uTioJOpF_sy8PHvEkaZ1B0ahy-KdKXygejBtQeQpIACZ0xNLnEYCfTbfkSC3Klw/\
embed?start=false&loop=false&delayms=3000', 960, 569)
Out[4]:

A Closer Look at the Answer¶

Assume a patient is picked at random.

  • Prior probability of disease
    • P(Disease) = 0.001 = one-tenth of 1%
  • Posterior probability of disease given positive test
    • P(Disease | Test positive) = 0.0194... ≅ 2%
  • Bigger than the prior, but still pretty small
  • Should we approve such a test?
    • The test has low error rates compared to most tests
  • How can this be?

Assumptions Matter¶

  • "Assume a patient is picked at random."
    • But usually, people aren’t picked at random for medical tests
    • So our intuition about randomly picked patients may not be great
  • For a randomly picked patient, the result does make sense, because the disease is very rare.
  • What if the doctor believes there is a 10% chance the patient has the disease?

Bayes' Rule and Covid Testing¶

Image Source: The obscure maths theorem that governs the reliability of Covid testing - The Guardian

Demo: Bayes' Rule¶

Create a function that calculates $P(A \mid B) = \frac{P(A) \cdot P(B\mid A)}{P(B)}$

In [5]:
def bayes_rule(pr_a, pr_b_given_a, pr_b_given_not_a):
    """
    Bayes' Rule
    P(A | B) = P(A)P(B|A) / P(B)
    
    To Compute P(B)
        P(B) = P(B, A) + P(B, Not A) 
             = P(A)P(B|A) + P(Not A)P(B | Not A)
    """
    pr_b = pr_a * pr_b_given_a + (1 - pr_a) * pr_b_given_not_a
    return (pr_a * pr_b_given_a) / pr_b

Use bayes_rule to calculate the probability for the original medical question.

In [6]:
pr_disease = 1/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05

bayes_rule(pr_disease, pr_pos_given_disease, pr_pos_given_no_disease)
Out[6]:
0.019434628975265017

How does the conditional probability change when the prior is larger?

In [7]:
# updating with a subjective prior of 1%

pr_disease_update = 10/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05

bayes_rule(pr_disease_update, pr_pos_given_disease, pr_pos_given_no_disease)
Out[7]:
0.16666666666666669
In [8]:
# updating with a subjective prior of 10%

pr_disease_update2 = 100/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05

bayes_rule(pr_disease_update2, pr_pos_given_disease, pr_pos_given_no_disease)
Out[8]:
0.6875

Notice how quickly the Posterior probability climbs as the Prior probability increases.

In [9]:
pr_disease = np.arange(1,999)/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05

post = bayes_rule(pr_disease, pr_pos_given_disease, pr_pos_given_no_disease)
Table().with_columns(
    "Prior Pr(Disease)", pr_disease, 
    "Posterior Pr(Disease | Pos. Test)", post).iplot("Prior Pr(Disease)")

Subjective Probabilities¶

Subjective Probabilities¶

  • A probability of an outcome can be thought of as:
    • A Perspective: The frequency with which it will occur in repeated trials
    • Another Perspective: The subjective degree of belief that it will (or has) occurred
  • Why use subjective priors?
    • In order to quantify a belief that is relevant to a decision
    • If the subject of your prediction was not selected randomly from the population

Adopted from UC Berkeley DATA 8 course materials.

This content is offered under a CC Attribution Non-Commercial Share Alike license.